The aim of this capstone is to practise unsupervised learning methods and alternative models. For this purpose we used a data from a web site Turo.com, the biggest car sharing platform in the USA. There are two problems that required unsupervised learning methods:
Location clustering For the purpose of location clustering we used the Latitude and Longtitude assosiated with every listing in the data. We approached our dataset with k-means, dbscan and hierarhical clustering techniques. We will use the siloughette score for each of the methodics to define how precise is every method.
Car category clustering For this purpose we manually assigned categories for each car. We will try to do clustering with unsupervised clustering techniques ( K-means, Hierarchical Clustering and DBSCAN) with dimensionality reduction such as PSA and UMAP but more likely the best way will be to procees with the supervised classifier model. We will use neural network multi layer perceptron model and random forest to see what performs better.
#here is how our data looking
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 45233 entries, 38 to 168108 Data columns (total 100 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Car ID 45233 non-null int64 1 URL 45233 non-null object 2 Make 45233 non-null object 3 Model 45233 non-null object 4 Trim 23387 non-null object 5 Year 45233 non-null int64 6 Color 27235 non-null object 7 Transmission 45233 non-null object 8 Fuel Type 37350 non-null object 9 Number of seats 36540 non-null float64 10 Number of doors 37317 non-null float64 11 GPS 20734 non-null float64 12 Convertible 3777 non-null float64 13 Status 45233 non-null object 14 Booking Instantly 45233 non-null int64 15 City 45233 non-null object 16 State 45233 non-null object 17 ZIP Code 44993 non-null object 18 Country 45233 non-null object 19 Latitude 45233 non-null float64 20 Longitude 45233 non-null float64 21 Owner ID 45233 non-null int64 22 Owner 45233 non-null object 23 Price per day 45233 non-null float64 24 Custom Delivery fee 21532 non-null object 25 Airport Delivery fee 12809 non-null object 26 Distance Included/day, Miles, KM 43374 non-null float64 27 Distance Included/Week, Miles, KM 43345 non-null float64 28 Distance Included/Monthy, Miles, KM 43013 non-null float64 29 Booking Discount - Weekly 45233 non-null int64 30 Booking Discount - Monthly 45233 non-null int64 31 Fee for extra mile, $ 45233 non-null float64 32 Registration date 45233 non-null object 33 Trip Count 45233 non-null int64 34 Reviews number 45233 non-null int64 35 Owner rate 23704 non-null float64 36 Airport ride needed 12077 non-null object 37 Business class 45233 non-null bool 38 Vehicle protection 45233 non-null object 39 numberOfFaqs 45233 non-null int64 40 regularAirportDeliveryFee 12809 non-null object 41 minimumAgeInYearsToRent 3207 non-null float64 42 numberOfFavorites 45233 non-null int64 43 highValueVehicle 45233 non-null bool 44 frequentlyBooked 45233 non-null bool 45 dateRangeRate 0 non-null float64 46 Occupancy Jan 15998 non-null float64 47 Occupancy Feb 19207 non-null float64 48 Occupancy Mar 14503 non-null float64 49 Occupancy Apr 5804 non-null float64 50 Occupancy May 6593 non-null float64 51 Occupancy Jun 7605 non-null float64 52 Occupancy Jul 9543 non-null float64 53 Occupancy Aug 10584 non-null float64 54 Occupancy Sep 10860 non-null float64 55 Occupancy Oct 11697 non-null float64 56 Occupancy Nov 12869 non-null float64 57 Occupancy Dec 16435 non-null float64 58 Another Occupancy Jan 10252 non-null float64 59 Another Occupancy Feb 16169 non-null float64 60 Another Occupancy Mar 16690 non-null float64 61 Another Occupancy Apr 3375 non-null float64 62 Another Occupancy May 3953 non-null float64 63 Another Occupancy Jun 4711 non-null float64 64 Another Occupancy Jul 5742 non-null float64 65 Another Occupancy Aug 6492 non-null float64 66 Another Occupancy Sep 6424 non-null float64 67 Another Occupancy Oct 6921 non-null float64 68 Another Occupancy Nov 7555 non-null float64 69 Another Occupancy Dec 9718 non-null float64 70 Unnamed: 70 0 non-null float64 71 Unnamed: 71 0 non-null float64 72 Unnamed: 72 0 non-null float64 73 Unnamed: 73 0 non-null float64 74 Unnamed: 74 0 non-null float64 75 Unnamed: 75 0 non-null float64 76 Unnamed: 76 0 non-null float64 77 Unnamed: 77 0 non-null float64 78 Unnamed: 78 0 non-null float64 79 Unnamed: 79 0 non-null float64 80 Unnamed: 80 0 non-null float64 81 Unnamed: 81 0 non-null float64 82 Unnamed: 82 0 non-null float64 83 Unnamed: 83 0 non-null float64 84 Unnamed: 84 0 non-null float64 85 Unnamed: 85 0 non-null float64 86 Unnamed: 86 0 non-null float64 87 Unnamed: 87 0 non-null float64 88 Unnamed: 88 0 non-null float64 89 Unnamed: 89 0 non-null float64 90 Unnamed: 90 0 non-null float64 91 Unnamed: 91 0 non-null float64 92 Unnamed: 92 0 non-null float64 93 Unnamed: 93 0 non-null float64 94 Unnamed: 94 0 non-null float64 95 Unnamed: 95 0 non-null float64 96 Unnamed: 96 0 non-null float64 97 Unnamed: 97 0 non-null float64 98 Unnamed: 98 0 non-null float64 99 Unnamed: 99 0 non-null float64 dtypes: bool(3), float64(68), int64(10), object(19) memory usage: 33.9+ MB
At this stage we will check the sanity of the data, make observations on the data type, missing values, shape of the data
# Remove all columns between column index 70 to 99
df.drop(df.iloc[:, 69:100], inplace = True, axis = 1)
#replace all the missing data with 0
df = df.fillna(0)
The first problem that we're solving is to be able to separate our listings by location. We would like to use k-means because our data has only two variables that are measurements of distance. K-means should be the best fir for it however we're going to validate with other methods as well.
# Variable with the Longitude and Latitude
Z = df[['Car ID','Latitude','Longitude']]
K_clusters = range(1,20)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['Latitude']]
X_axis = df[['Longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]
# Visualize the amount of clusters
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
#Let's define labels for location
kmeans = KMeans(n_clusters = 12, init ='k-means++')
kmeans.fit(Z[Z.columns[1:3]]) # Compute k-means clustering.
Z['cluster_label'] = kmeans.fit_predict(Z[Z.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(Z[Z.columns[1:3]]) # Labels of each point
#put them on the plot
Z.plot.scatter(x = 'Latitude', y = 'Longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
<matplotlib.collections.PathCollection at 0x1a366f7a10>
#This is our silhouette score for k-means
print(metrics.silhouette_score(Z[Z.columns[1:3]], labels, metric='euclidean'))
0.6456396547584401
# Defining the agglomerative clustering
dbscan_cluster_l = DBSCAN(eps=1, min_samples=5)
# Fit model
clusters = dbscan_cluster_l.fit_predict(Z[Z.columns[1:3]])
print(metrics.silhouette_score(Z[Z.columns[1:3]], clusters, metric='euclidean'))
0.0023567980682241657
Z.plot.scatter(x = 'Latitude', y = 'Longitude', c = clusters, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
<matplotlib.collections.PathCollection at 0x1a366de290>
This is definitely won't be our choise based on the metrics
from sklearn.cluster import AgglomerativeClustering
# Defining the agglomerative clustering
agg_cluster = AgglomerativeClustering(linkage='complete',
affinity='cosine',
n_clusters=5)
# Fit model
clusters_h = agg_cluster.fit_predict(Z[Z.columns[1:3]])
Z.plot.scatter(x = 'Latitude', y = 'Longitude', c = clusters_h, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
<matplotlib.collections.PathCollection at 0x1a28b7dd10>
print(metrics.silhouette_score(Z[Z.columns[1:3]], clusters_h, metric='euclidean'))
0.43172051126894256
We'vew got a very good results with the clustering especially k-means. Let's keep this score an join the labels to our main dataset. We will use result from k-means where the score was 0.65.
### Adding location cluster into the dataframe
df = df.merge(Z, left_on='Car ID', right_on='Car ID')
df.head()
| Car ID | URL | Make | Model | Trim | Year | Color | Transmission | Fuel Type | Number of seats | ... | Category_y | Latitude_x | Longitude_x | cluster_label_y | Category_x | Category_code | Category_y | Latitude_y | Longitude_y | cluster_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 92 | https://turo.com/rentals/cars/ca/san-francisco... | bmw | 3 series | 0 | 2004 | OTHER | A | 0 | 0.0 | ... | Intermediate | 37.771757 | -122.440658 | 1 | intermediate | 6 | Intermediate | 37.771757 | -122.440658 | 1 |
| 1 | 279 | https://turo.com/rentals/cars/ca/hillsborough/... | porsche | 911 | Carrera S | 2006 | GRAY | M | Gas | 4.0 | ... | Exotic | 37.578681 | -122.363776 | 1 | exotic | 5 | Exotic | 37.578681 | -122.363776 | 1 |
| 2 | 445 | https://turo.com/rentals/cars/ca/berkeley/toyo... | toyota | prius | 0 | 2010 | BLUE | A | 0 | 0.0 | ... | Economy | 37.857234 | -122.265831 | 1 | economy | 3 | Economy | 37.857234 | -122.265831 | 1 |
| 3 | 724 | https://turo.com/rentals/cars/ca/san-jose/toyo... | toyota | corolla | 0 | 2012 | BLACK | A | 0 | 0.0 | ... | Intermediate | 37.353274 | -121.892823 | 1 | intermediate | 6 | Intermediate | 37.353274 | -121.892823 | 1 |
| 4 | 972 | https://turo.com/rentals/cars/ca/santa-ana/inf... | infiniti | m35 | Sport | 2006 | BLACK | A | Gas | 5.0 | ... | Intermediate | 33.772218 | -117.891395 | 7 | intermediate | 6 | Intermediate | 33.772218 | -117.891395 | 7 |
5 rows × 84 columns
#combining make and model into category name
df1 = pd.read_csv ('Category_DB.csv', encoding='ISO-8859–1')
df['for category']= df['Make']+df['Model']
df = pd.merge (df, df1[['Car ID', 'Category']], on = 'Car ID', how="left")
#adjust some text errors
df.Make = df.Make.str.lower()
df.Model = df.Model.str.lower()
df.Category = df.Category.astype('category')
df.Category = df.Category.str.lower()
df = df[df['Category'].notnull()]
df.Category = df.Category.astype('category')
df['Category_code'] = df['Category'].cat.codes
df.head(10)
| Car ID | URL | Make | Model | Trim | Year | Color | Transmission | Fuel Type | Number of seats | ... | Make_toyota | Make_triumph | Make_vauxhall | Make_volkswagen | Make_volvo | Make_vw | Make_yugo | value_True | Trans_M | Price per day_win | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 92 | https://turo.com/rentals/cars/ca/san-francisco... | bmw | 3 series | 0 | 2004 | OTHER | A | 0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90.0 |
| 1 | 279 | https://turo.com/rentals/cars/ca/hillsborough/... | porsche | 911 | Carrera S | 2006 | GRAY | M | Gas | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 180.0 |
| 2 | 445 | https://turo.com/rentals/cars/ca/berkeley/toyo... | toyota | prius | 0 | 2010 | BLUE | A | 0 | 0.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 80.0 |
| 3 | 724 | https://turo.com/rentals/cars/ca/san-jose/toyo... | toyota | corolla | 0 | 2012 | BLACK | A | 0 | 0.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 22.0 |
| 4 | 972 | https://turo.com/rentals/cars/ca/santa-ana/inf... | infiniti | m35 | Sport | 2006 | BLACK | A | Gas | 5.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.0 |
| 5 | 1081 | https://turo.com/rentals/cars/ca/san-leandro/h... | honda | civic | 0 | 2009 | GRAY | A | 0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 37.0 |
| 6 | 1085 | https://turo.com/rentals/cars/ca/sunnyvale/hon... | honda | accord | 0 | 2010 | BLACK | A | 0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 81.0 |
| 7 | 1461 | https://turo.com/rentals/trucks/ca/berkeley/to... | toyota | tacoma | Base | 2006 | BLUE | A | Gas | 4.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 95.0 |
| 8 | 1745 | https://turo.com/rentals/cars/ca/los-angeles/f... | fisker | karma | Eco-Sport | 2012 | GRAY | A | Hybrid | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 127.0 |
| 9 | 1759 | https://turo.com/rentals/cars/in/evansville/po... | porsche | 911 | Carrera | 2006 | BLACK | M | 0 | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 137.0 |
10 rows × 167 columns
df.groupby("Category")["Category"].count()
Category cargo van 25 convertible 2418 economy 6788 electric 1611 exotic 3901 intermediate 11157 minivan 1681 motorhome 5 premium 4924 premium suv 4697 retro 437 suv 5648 truck 1423 van 103 Name: Category, dtype: int64
#Let's see how categories looking
Y.value_counts()/len(Y)
5 0.248940 2 0.151457 11 0.126021 8 0.109867 9 0.104802 4 0.087041 1 0.053952 6 0.037507 3 0.035945 12 0.031751 10 0.009751 13 0.002298 0 0.000558 7 0.000112 Name: Category_code, dtype: float64
df = pd.concat([df,pd.get_dummies(df.Convertible, prefix="Conv", drop_first=True)], axis=1)
df = pd.concat([df,pd.get_dummies(df['Fuel Type'], prefix="fuel", drop_first=True)], axis=1)
df = pd.concat([df,pd.get_dummies(df['Make'], prefix="Make", drop_first=True)], axis=1)
df = pd.concat([df,pd.get_dummies(df['highValueVehicle'], prefix="value", drop_first=True)], axis=1)
df = pd.concat([df,pd.get_dummies(df['Transmission'], prefix="Trans", drop_first=True)], axis=1)
dummy_column_names = list(pd.get_dummies(df.Convertible, prefix="Conv", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(df['Fuel Type'], prefix="fuel", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(df['Make'], prefix="Make", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(df['highValueVehicle'], prefix="value", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(df['Transmission'], prefix="Trans", drop_first=True).columns)
X = df[['Price per day', 'Year', 'Number of seats', 'Number of doors']+dummy_column_names]
#Let's keep one more set for the variable without price
X1 = df[['Year', 'Number of seats', 'Number of doors']+dummy_column_names]
Y = df["Category_code"]
#Let's scale the variables
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Reduce it to two components for visualization and check how long will it take
X_pca = PCA(2).fit_transform(X_std)
time_start = time.time()
X_pca = PCA(2).fit_transform(X_std)
print('PSA done! Time elapsed: {} seconds'.format(time.time()-time_start))
PSA done! Time elapsed: 0.07679605484008789 seconds
time_start = time.time()
umap_results = umap.UMAP(n_neighbors=5,
min_dist=0.3,
metric='correlation').fit_transform(X_std)
print('UMAP done! Time elapsed: {} seconds'.format(time.time()-time_start))
UMAP done! Time elapsed: 38.6958270072937 seconds
#we see that the UMAP technique is much longer and let's see how it's looking
plt.figure(figsize=(10,5))
plt.scatter(umap_results[:, 0], umap_results[:, 1])
plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()
This isn't a good perfomace from a clustering perspective plus the calculation time is too high
umap_results7 = umap.UMAP(n_neighbors=7,
min_dist=0.3,
metric='correlation').fit_transform(X_std)
plt.figure(figsize=(10,5))
plt.scatter(umap_results7[:, 0], umap_results7[:, 1])
plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()
# Calculate predicted values.
y_pred = KMeans(n_clusters=9, random_state=123).fit_predict(X_std)
# Plot the solution.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred)
plt.show()
# Non-labeled verification:
labels = KMeans(n_clusters=7, random_state=123).fit_predict(X_std)
print(metrics.silhouette_score(X_pca, labels, metric='euclidean'))
0.13767906054356704
#the score is very low since our goal is to be closer to 1. Let's try another techniques
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X_pca = PCA(2).fit_transform(X_std)
# Defining the agglomerative clustering
dbscan_cluster = DBSCAN(eps=1, min_samples=5)
# Fit model
clusters_db_pc = dbscan_cluster.fit_predict(X_std)
#plot the results
pca = PCA(n_components=2).fit_transform(X_std)
plt.figure(figsize=(10,5))
colours = 'rbg'
for i in range(pca.shape[0]):
plt.text(pca[i, 0], pca[i, 1], str(clusters_db_pc[i]),
fontdict={'weight': 'bold', 'size': 50}
)
plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()
print(metrics.silhouette_score(X_std, clusters_db_pc, metric='euclidean'))
0.8957001493471506
#The score is much better. Let's try with UMAP dimensinality reduction
# Defining the agglomerative clustering
dbscan_cluster = DBSCAN(eps=1, min_samples=5)
# Fit model
clusters_umap = dbscan_cluster.fit_predict(umap_results7)
print(metrics.silhouette_score(X_std, clusters_umap, metric='euclidean'))
-0.529524664838877
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
Xpca = PCA(2).fit_transform(X_std)
from sklearn.cluster import AgglomerativeClustering
# Defining the agglomerative clustering
agg_cluster = AgglomerativeClustering(linkage='complete',
affinity='cosine',
n_clusters=3)
# Fit model
clusters_h = agg_cluster.fit_predict(X_pca)
pca = PCA(n_components=2).fit_transform(X_std)
plt.figure(figsize=(10,5))
colours = 'rbg'
for i in range(pca.shape[0]):
plt.text(pca[i, 0], pca[i, 1], str(clusters_h[i]),
fontdict={'weight': 'bold', 'size': 50}
)
plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()
print(metrics.silhouette_score(X_std, clusters_h, metric='euclidean'))
0.07522371146298965
print(metrics.silhouette_score(X_std, clusters_h_umap, metric='euclidean'))
-0.0018369254175829987
the rate is very low, we will be skipping it.
During unsupervised clustering process we were unable to define clear solution for our problem. However, based on the silhouette score the best result is coming from DBSCAN clasterisation using pca technique was high we still can't rely on it. We need to go with more sophisticated approach since our variables contain mixed (numeric and non-numeric and categorical) types of data.
In this case we need to move forward and try to create a classificator with neural network to be able to assign the category for all new cars that will be add into the system.
The power of neural networks come from their ability to learn the representation in the training data and how to best relate it to the output variable that you want to predict. We will try to use MLP model as well as Random Forest Classifier and compare the performance
# Import the model.
from sklearn.neural_network import MLPClassifier
# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(1000,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
mlp.score(X, Y)
0.6286090410103083
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)
array([0.47391304, 0.54466377, 0.49665328, 0.6543141 , 0.6141996 ])
Let's try it with the standartized X values - X_std
# Import the model.
from sklearn.neural_network import MLPClassifier
# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X_std, Y)
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(1000,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
mlp.score(X_std, Y)
0.7390334240706858
This is definitely much better. We've got our model explained by 10points better when our data is standartized
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X_std, Y, cv=5)
array([0.59253066, 0.63521802, 0.68987059, 0.72575064, 0.69591427])
Let's try another clasifier to see if our results can go higher
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
rfc = ensemble.RandomForestClassifier()
cross_val_score(rfc, X, Y, cv=5)
array([0.56109253, 0.6016505 , 0.65896921, 0.67496372, 0.65539183])
these numbers are good but the best performance we've got with the
In this capstone we were trying to solve the problem of clustering the data. Our original raw data has variables that should be clustered in order to make more precise predictions and to reduce the amount of noise and overfitting. Unsupervised clustering is mainly a process of putting several objects into one bin while models is an ability to predict an outcome. Two issues were solved:
1. Location clustering: The unsupervised learning worked well for clustering numerical information with Lat and Long (we got the silhouette score on the level of 0.65 with k-means .
2. Car category clustering we tried several unsupervised learning techniues but non of them, work good for hybrid variables. We were able to achieve pretty high silghouette score with dbscan (0.84) but it doesn't work good along with the labeled data. We tried another methods as well. K-means and hierarchical clustering are good at finding circular (or convex) clusters, which makes them great tools for identifying well-separated clusters. But, unfortunately, they're not good at identifying clusters that are not well-separated or that have non-convex shapes such as rings inside rings. Anyway, this problem better should be solved with the supervised model since there are more mixed data involved, it has non-linear connections. For this purpose we used supervised neural network classifier that gave us around 0.70 score. We check models with cross validation score to make sure there is no overfitting in the data These findings are extremelly useful for the development an investment tool in order to predict potential earnings from sharing cars on the platform. We're going to continue our work on this data set in order to increase the relaibality of our calculations.